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(57) ABSTRACT 

Concept searching using a Boolean or keyword search 
engine. Documents are p reprocessed before being passed to 
a search engine by identifying, on a word-by- word basis, the 
"word tokens" contained in the document. Once the word 
tokens have been extracted, each word token is referenced in 
a concept database that maps word tokens to concept iden- 
tifiers. The concept identifiers associated with the word 
tokens are converted into unique non-word concept tokens 
and arranged into a list. The list is then inserted into the 
document as invisible but searchable text. The document is 
then transferred to the server monitored by the search 
engine. Search queries are preprocessed before being passed 
to the search engine in tie same manner. The query is first 
broken into word tokens and the word tokens are then 
referenced in the concept database. All associated concept 
identifiers are retrieved and converted to unique concept 
tokens. The concept tokens are then combined into a string 
and sent to the search engine as an ordinary query. 

16 Claims, 5 Drawing Sheets 
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METHOD AND APPARATUS FOR CONCEPT using a Boolean or keyword search engine. Using the 

SEARCHING USING A BOOLEAN OR method and apparatus of the exemplary embodiment, docu- 

KEYWORD SEARCH ENGINE ments are preprocessed before being passed to the search 

engine for inclusion in the search engine's database. Search 
TECHNICAL FIELD 5 queries arc also preprocessed before being passed to the 

search engine. 

This invention generally relates to database search With rcgard to ±e prcproccssiog of documcnts , cacb 
engines for computer systems. More particularly, this wven- document is scanned on a word-by-word basis to identify the 
Uon relates to concept searching using a Boolean or keyword "word tokens" contained in the document. Word tokens are 
search engine. 1Q actual words or word-like strings such as dates, numbers, 

DA^^i>Aii»m «r T.tr ,Kn^rn,o„ clc ' 0oce ^ word toke ns in » document have been 
BACKGROUND OF THE INVENTION extracted, each word token is located in a "concept data- 
Database search engines permit users to perform queries base " ma P s word tokens to concept identifiers. Each 
on a set of documents by submitting search terms Users W °' d tokeD may map to zero or more conccpt id ™tiners. 
must typically submit one or more search terms to the search is , , 0nc f the ™ nc *P x identifiers associated with each word 
engine in a format specified by the search engine. Most J™. -°T. °? 00 f cxtractc d fron l (hc . C0QCC P\ database, a 
searehengmesspedf/matsearchtennsshouldbe^ubmitted ^^Jf^^^^ am ^:^^ 
as a Boolean or keyword search query (i.e. "red OR green" C ° n ° Cpt * dcnUficra m * c 15 ^en converted into a unique 
or "blue AND black"). Boolean or keyword search queries n0IW T* t C ° nCept token whreh the concept. A 
can become extremely complex as the user adds more search 20 °°° C . Cpt l ° kcn 1 1S a Don " word character strmg which identifies 
terms and Boolean operators. Moreover, most search " d f ' £ ippcd to a , C0 " cc P t - For »sta^ ihe concept token 
engines have complex syntax rules regarding how a Boolean , , ™Z map to ^ °° nccpl of color concc P t 
or keyword search query must be constructed. For users to ^ ™ th f. a a T ged ^ * ^ 
get accurate search results, therefore, they must remember , nce thc . of ™ nce P l tokens has been created, the 
the appropriate syntax rules and apply them in an effective 25 tokens arc inserted into the document. In an exemplary 
manner. This process can be difficult for many users and, ™ diment .' 3 hypertexl markup lan S™ge ("HTML") 
unless mastered, may result in searches which return irrel- MEIA tag 125 uscd to inscrt ^ concept tokens into thc 
evant documcnts document. Using the HTML META tag, the concept tokens 

"Natural language" search engines have been developed Z^f* T?* ^ engiDC and thcrC ' 

whichpermituserstosubmitanaturallanguagequerytotoe 30 ^ } ^ t0 ^ ^ ^ 

search engine rather than just keywords. For instance! a user 2^;™ H Th'T^k * ^ 
may input the simple natural language sentence "How do I , * do ^ ments mdexed b y »«* en S^ 

fix my car?" instead of the more complex Boolean search ^ t u • , 

query "how AND to AND fix AND car?" Instead of search- 'f g preprocessing of search queries, an 

ing for just the keywords contained in the search query, a 35 addltl °" aI component is interposed between the query sub- 
typical natural language search engine will extract the y USCr md Ac scarch cn 8 inc - ^ component 
. concepts implied by the query and search the database for Processes the query m much the same way as document 
documents referencing the concepts. A natural language P re P ro «"* described above, and then sends a modified 
search engine will therefore return documents from ils ^ { ° tht Search engine - 

database which contain the concepts contained in the scarch 40 Que . nes are preprocessed by first breaking the search 
query even if the documents do not contain the exact words ! crms int0 word tokcns - The word tokens are then referenced 
in thc search query. A natural language scarch query may be m ^ ^^P 1 database (the same database used for docu- 
submitted to a Boolean or keyword search engine. However, mCnt preprocessing) and any associated concept identifiers 
these types of search engines will only return documents are . relneved - The concept identifiers are then converted to 
containing the exact words in the search query « um que concept tokens as described above and are combined 

Although natural language search engines provide the ^^^^^^-^"PT^ 101116 
benefits of easy to understand natural language search TV^S? Sm^ T ? 
queries and concept searching, natural language search rlJZ TT f ° f ^ ^ Strmg 

engines are not without theiT drawbacks. For \xamp7e, ^ P re P'°<*** d ^ which is then sent to the 
natural language search engines are considerably more 50 ^ gm ®" . D . 

expensive to develop than a Boolean or keyword search - V 16 ^modified Boolean or keyword search engine then 
engine. Moreover, natural language search engines can be of th e documents whose concept tokens most 

difficult and expensive to implement, especially where they y m ** h ±G COncept tokens m the modified query. The 

arc used to replace existing Boolean or keyword search P re P rocessu ;g of both documents and queries is transparent 
engines. " 55 to the search engine. However, the exemplary embodiment 

ti,»„f«™ <u , j , of the present invention described herein solves all of the 

Therefore, there is a need for a method and apparatus for ab ove-dcscribed problems by modifying the built m func- 

using a Boolean or keyword search engine with natural for ^cc V \s rather than keywords, 

language search queries, (2) which permits concept search- « iwfiL •# • Z- . VVu 

ing using a Boolean or keyword search engine, and (3) 60 JSE^ k h ™ ^ °l "l ^ mVenti ° n to 
which may be implemented without any modification to E h T * f ^ 

Boolean or keyword search engine. which permits effecUve searcbar^ usmg a Boolean or key- 

^ word search engine with natural language search queries. 

SUMMARY OF THE PRESENT INVENTION It is also an object of the present invention to provide a 

65 method and apparatus for database searching which permits 
The present invention satisfies the above-described needs concept searching using a Boolean or keyword search 
by providing a method and apparatus for concept searching engine. 
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It is a further object of the present invention to provide a such as during start-up, is stored in ROM 24. The personal 
method and apparatus for natural language and concept computer 20 further includes a hard disk drive 27, a mag- 
searching using a Boolean or keyword search engine which netic disk drive 2S, e.g., to read from or write to a removable 
may be implemented without any modification to the Bool- disk 29, and an optical disk drive 30, e.g., for reading a 
ean or keyword search engine. s CD-ROM disk 31 or to read from or write to other optical 

That the present invention and the exemplary embodi- media. The hard disk drive 27, magnetic disk drive 28, and 
ments thereof overcome the problems and drawbacks set optical disk drive 30 are connected to the system bus 23 by 
forth above and accomplish the objects of the invention set a hard disk drive interface 32, a magnetic disk drive inter- 
forth herein will become apparent from the detailed descrip- ^ ace 33, and an optical drive interface 34, respectively. The 
tion of exemplary embodiments which follows. 10 drives and their associated computer- readable media provide 

nonvolatile storage for the personal computer 20. Although 

BRIEF DESCRIPTION OF THE DRAWINGS the description of computer-readable media above refers to 

TTtr, i ic „ ki«^v Ai^, f o „ u'a i a hard disk, a removable magnetic disk and a CD-ROM disk, 

FIG 1 is a block diagram of a networked personal it should be appreciated b mose ^ d in ^ art tha{ ^ 

TrZf'Tjr *?• ° P6f f ^ CnVlf0DmCnt for an 15 types of medUwhich are'readable by a computer, such a 
embodiment of the present invention. magne(ic ^ m£mory ^ ^ 

FIG. 2 is a flow diagram illustrating steps for the prepro- Bernoulli cartridges, and the like, may also be used in the 

cessing of documents. exemplary operating environment 

FIG. 3 is a flow diagram illustrating steps for the prepro- A number of program modules may be stored in the drives 

cessing of database queries. 20 and RAM 25, including an operating system 35, one or more 

FIG. 4 is a diagram illustrating the preprocessing of an application programs 36, other program modules 37, and 

exemplary document. " program data 38. A user may enter commands and informa- 

FIG. 5 is a diagram illustrating the preprocessing of an tion mto P ersonal computer 20 through a keyboard 40 

exemplary database query. and P 01Iltm g device, such as a mouse 42. Other input devices 

25 (not shown) may include a microphone, joystick, game pad, 

DETAILED DESCRIPTION OF AN satellite dish, scanner, or the like. These and other input 

EXEMPLARY EMBODIMENT devices are often connected to the processing unit 21 

through a serial port interface 46 that is coupled to the 

In an exemplary embodiment of the present invention, an system bus, but may be connected by other interfaces such 

application program is interposed between a user and a 30 as a game port or a universal serial bus (USB). A monitor 47 

Boolean or keyword search engine which preprocesses or other type of display device is also connected to the 

documents prior to submission to the search engine's data- system bus 23 via an interface, such as a video adapter 48 

base and also preprocesses search queries prior to submis- In addition to the monitor, personal computers typically 

sion to the search engine. In this manner, a Boolean or include other peripheral output devices (not shown), such as 

keyword search engine may be searched for concepts. 35 speakers or printers 

Exemplary Operating Envkonment The personal computer 20 may operate in a networked 

iJ" \ -< f followm g discussion are intended to environment using logical connections to one or more 

provide a brief, general description of a suitable computing remote computers, such as a remote computer 49 The 

SETT 6 ?* m Whi0h ,he mvention ma y be implemented. remote computer 49 may be a server, a router, a peer device 

While the invention will be described in the general context 40 or other common network node, and typically includes many 

of an application program that runs on an operating system or all of the elements described relative to the personal 

in conjunction with a personal computer, those skilled in the computer 20, although only a memory storage device 50 has 

art will recognize that the invention also may be imple- been illustrated in FIG. 1. The logical connections depicted 

merited in combination with other program modules. in FIG. 1 include a local area network (LAN) 51 and a wide 

Generally, program modules include routines, programs, as area network (WAN) 52. Such networking environments are 

components data structures, etc. that perform particular commonplace in offices, enterprise-wide computer 

tasks or implement particular abstract data types. Moreover, networks, intranets and the Internet 

those skilled in the art will appreciate that the invention may When used in a LAN networking environment, the per- 

bc practiced with other computer system configurations, sonal computer 20 is connected to the LAN 51 through a 

including hand-held devices, multiprocessor systems, so network interface 53. When used in a WAN networking 

microprocessor-based or programmable consumer environment, the personal computer 20 typically includes a 

electronics, minicomputers, mainframe computers, and the modem 54 or other means for establishing communications 

like. The mvention may also be practiced in distributed over the WAN 52, such as the Internet The modem 54 

computing environments where tasks are performed by which may be internal or external, is connected to the system 

remote processing devices that are linked through a com- ss bus 23 via the serial port interface 46 In a networked 

munications network. In a distributed computing environment, program modules depicted relative to the 

environment, program modules may be located in both local personal computer 20, or portions thereof, may be stored in 

an l rCm0t , C mcmor y storage devices. the remote memory storage device. It will be appreciated 

With reference to FIG. 1, an exemplary system for imple- that the network connections shown are exemplary and other 

menting the mvention includes a conventional persona] 60 means of establishing a communications link between the 

computer 20, including a processing unit 21, a system computers may be used. 

memory 22, and a system bus 23 that couples the system As discussed earlier, the exemplary embodiments of the 
memory to the processing unit 21. The system memory 22 present invention are embodied in application programs run 
includes read only memory (ROM) 24 and random access by an operating system 35. The operating system 35 gen- 
memory (RAM) 25. A basic input/output system 26 (BIOS), 65 erally controls the operation of the previously discussed 
containing the basic routines that help to transfer informa- personal computer 20, including input/output operations In 
tion between elements within the personal computer 20, the exemplary operating environment, the invention is used 
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in conjunction with Microsoft Corporation's "WINDOWS ring now to FIGS. 1 and 2, the method 200 for preprocessing 
NT" and "WINDOWS 95" operating systems. However, it documents begins at step 210 where a document file is read, 
should be understood that the invention can be implemented At step 215, the first word token in the document file is read, 
for use in other operating systems, such as Microsoft Cor- As indicated above, word tokens may be actual words, 
poration's "WINDOWS 3.1" and "WINDOWS 95" operat- s word-like strings such as dates and numbers, or any other 
ing systems, IBM Corporation's "OS/2" and "AIX" operat- combination of characters. 

ing systems, SunSoft's "SOLARIS" operating system used From step 215, the method continues to step 220, where 
in workstations manufactured by Sun Microsystems, the word token is looked up in the concept database 63. The 
Hewlett-Packard's "HP-UX" and "RT-UX" operating concept database 63 is a database which maps word tokens 
systems, and the operating systems used in "MACINTOSH" jo to concepts. In particular, each word token contained in the 
computers manufactured by Apple Computer, Inc. concept database 63 may map to zero or more concept 

With the above preface on the exemplary operating envi- identifiers. For instance, the word token "red'' may map to 
ronment for embodiments of the present invention, the the concepts "color," "hue," and "shade." In an embodiment, 
remaining figures illustrate aspects of several embodiments each word token may also have an associated numerical 
of the present invention. In FIG. 2, a flow diagram is 15 "weight" which describes how strongly the word token 
illustrated showing the steps for the preprocessing of docu- implies the concept represented by the associated concept 
ments. FIG. 3 is a flow diagram illustrating the steps for identifier. The concept database 63 is created manually, 
preprocessing database queries. In FIG. 4, the operation of From step 220, the method continues to decision step 225, 
the method and apparatus of an exemplary embodiment of where a determination is made as to whether the word token 
the present invention are shown using an exemplary docu- 20 is contained in the concept database 63. If the word token is 
mcnt. In FIG. 5 the operation of the method and apparatus not contained in the concept database 63, the "NO" branch 
of an embodiment of the present invention is illustrated is taken to step 240. If the word token is contained in the 
using an exemplary database query. concept database 63, the "YES" branch is followed to step 

Operation of a Typical Boolean or keyword Search Engine 230, where the concept identifiers associated with the word 

The present invention modifies the built-in functionality 25 token, if any, are read from the concept database 63 Also 
of a typical Boolean or keyword search engine to permit read are the numerical weights, if any, associated with the 
searching for concepts. Therefore, in order to understand the word token. 

operation of the present invention, it is helpful to understand From step 230, the method continues to step 235 where 
the operation of a typical Boolean or keyword search engine. the word token weight is summed with the sum of the word 
Many Boolean or keyword search engines function by 30 token weights for any previous word tokens in the document 
applying the following method to each searchable docu- file which had the same concept identifier. In this manner 
ment. First the document file is read, and the plain text is the sum of all of the numerical weights for word tokens 
extracted. Any non-text information and special formatting which have the same concept identifier is created As dis- 
codes are ignored. The plain text is then broken into strings cussed in more detail below, this number indicates how 
delimited by spaces and punctuation characters, to produce 35 strongly a concept is described in a document file and is used 
a series or. word tokens. Word tokens can be actual words, or to order word tokens according to "strength " 
word-like strings such as dates, numbers, etc. An "inverted From step 230, the method continues to decision step 240 
index* is then built for the document file. For a given word where a determination is made as to whether there are more' 
token, this index can return the list of all searchable docu- word tokens in the document file. If there are more word 
ments containing that word token. 40 tokens contained in the document file, the "YES" branch is 

When a search query is submitted to the search engine, a taken to step 245, where the next word token in the docu- 
similar process extracts the word tokens from the search ment file is read. If there are no more word tokens contained 
query. The inverted index is then searched to find documents in the document file, the "NO" branch is taken to step 250 
which best match the query at a word token level. The where the sums of the word tokens weights for all concern 
closeness of match is most commonly based on whether the 45 identifiers are normalized so that the concept identifier with 
document satisfies a Boolean expression made up of the the highest weight equals 1000. For instance, if three con- 
query terms, or on a weighted aggregate of the terms in both cept identifiers have sums of word token weights of 135 
tiie query and the document such as the well-known "Vector 256, and 350, after normalization, their normalized weights 
Space Model (see e.g. "Automatic Text Processing", G. would be 386, 731, and 1000, respectively 
Saltan [AdcWWesley, 1989], section 10.1.1). The present 50 From step 250, the method continues to step 255, where 
invention modifies the above-described functionality of a the concept identifiers are assigned to concept "tokens" 
typical Boolean or keyword search engine to permit search- Concept tokens are non-word strings of characters which 
mg for concepts rather than mere word tokens. uniquely identify the concepts. In an embodiment, each 

Ilie Methods and Apparatus of the Disclosed Embodiments concept token is a string of characters consisting of an 

The disclosed embodiment for concept searching using a 55 uppercase 'Q' followed by three characters which are either 
Boolean or keyword search engine comprises two separate the numerical digits (0-9) or upper-case letters (A-Z) 
methods In an exemplary embodiment, these methods are Specifically, concept tokens are created in this manner by 
embodied in application program software modules. The converting the concept identifiers to base 36 (26 letters of the 
first of these two methods preprocesses documents prior to alphabet plus 10 digits) and then mapping the base 36 digits 
submission to the search engine for inclusion in the search 60 (0-35) to the characters A-Z. 0-9. An uppercase 'Q' is then 
engine's database. The second of these two methods pre- prepended to the string. In this manner, each concept iden- 
processes database queries prior to submission to the search tificr is assigned a unique non-word concept token such as 
engine. These methods are described in detail below. 'QABC,' 'Q1A5,' or 'QX2H.' Other methods for creating 

Dooiment Preprocessing unique concept tokens will be appreciated by those skilled in 

ine nrst method of the disclosed embodiment prepro- 65 the art, the only requirement being that the search engine 
cesses documents pnor to submission to the search engine must recognize such tokens as individual words and include 
for inclusion in its database of searchable documents. Refer- them in the inverted index. 
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From step 255, the method continues to step 260, where number of occurrences equal to 1000. For instance, if three 
the concept tokens are arranged in order of their associated concept tokens were referenced 5, 8, and 11 times in a search 
normalized weights. From step 260, the method continues to query, the normalized weights of the three concept tokens 
step 265, where the concept tokens arc embedding into the would be 455, 727, and 1000, respectively. In this manner, 
document file in their arranged order. In an exemplary 5 concepts which are referenced more frequently in a search 
embodiment, a hypertext markup language ("HTML") query are given a higher weight. 

META tag is inserted into the document to embed the From step 350, the method continues to step 355, where 
concept tokens. Using an HTML META tag, the concept the concept tokens are ordered according to their normalized 
tokens are treated as ordinary text by the search engine and weights. In an exemplary embodiment, concept tokens with 
may be searched, but arc not displayed and arc therefore to normalized weights less than a threshold value may be 
invisible to the user. Specifically, the NAME portion of the truncated to prevent searching for weak concepts. 
HTML META tag is the arbitrary string "nyms" and the From step 355, the method continues to step 360, where 
CONTENT portion is a space-separated list of concept the concept tokens and their associated normalized weights 
tokens that encodes the concepts found in the document. The are passed to the search engine. Also passed along with the 
arbitrary string "nyms" will later be used to instruct the 15 concept tokens are instructions to the search engine to search 
search engine to search the CONTENT portion of the META only the "nyms" portion of the HTML META tag described 
tag for concept tokens. This is described below. An example above for the concept tokens. In this manner, only the 
of a typical META tag containing concept tokens is: CONTENT portion of the HTML META tag is searched. 

<meta Nam e«" nyms" CONTENT- - qabc Q1AJ QX2H"> Therefore, the search engine matches concepts identified in 

20 the document and embedded in the META tag with concepts 
Other means for storing invisible text in a document file will identified in the query. The method 300 ends at step 370 In 
be appreciated by those skilled in the art. the exemplary embodiment, the concept tokens and weights 

From step 265, the method continues to step 270, where are passed to the search engine as a "vector query" that is 
the document file with encoded concept tokens is passed to a query using the "Vector Space Model" described above' 
the search engme for normal indexing and inclusion in the 25 Another embodiment could also pass the tokens in the form 
search engine's database. The method 200 ends at step 280. of a Boolean AND or OR query, or in any other form 
Query Preprocessing supported by the particular search engine being used 

Once all documents have been preprocessed as described Preprocessing an Exemplary Document 
above in connection with the method 200, the method 300 FIG. 4 illustrates the operation of the exemplary embodi- 
ror preprocessing search queries may begin. As discussed 30 ment for preprocessing a document using an exemplary 
above, the method 300 for preprocessing queries is embod- document. Referring now to FIGS. 1, 2 and 4, an exemplary 
led in an application program software module interposed document 405 contains text 406 and is to be preprocessed 
between the user and the search engine. The operation of this prior to submission to search engine 62 for inclusion in the 
software module is transparent to the user. document database 64. In an embodiment, document 405 

Referring now to FIGS. 1 and 3, the method 300 begins 35 will be stored in RAM 25 or on hard disk 27 prior to 
at step 310 where tiie search query input by the user is read. submission to search engine 62. After the document 405 has 
At step 315 the first word token contained m the search been preprocessed, it will be stored on remote computer 49 
query is read. As described above, word tokens may be in document database 64 

words, word-like strings, numbers, etc. At step 320, the word In an exemplary embodiment, the method 200 for docu- 
token is looked up in the concept database 63. The same 40 ment preprocessing is embodied in a document preprocess- 
concept database 63 described above for the preprocessing ing application program 60 which runs on remote computer 
of documents is also used for the preprocessing of search 49. However, those skilled in the art will understand that the 
queries. document preprocessing application program 60, may be run 

From step 320, the method continues to decision step 325, on personal computer 20 or on another computer system 
where a determination is made as to whether the word token 45 connected via local area network 51 or wide area network 
is contained in the concept database 63. If the word token is 52. 

not contained in the concept database 63, the "NO" branch Document 405 contains exemplary text 406 which reads: 
is taken to step 340. If the word token is contained in the "The appearance of a font may be changed by modifying its 
concept database 63, the "YES" branch is followed to step weight and color. A red, green, or blue font is attractive » 
330 where the concept identifiers, if any, associated with the 50 Preprocessing of document 405 will now be described with 
word token are read from the concept database 63. From step reference to FIG. 2 and method 200. The method 200 for 
330, the method continues to step 335, where the concept preprocessing exemplary document 405 begins at step 210 
identifiers are converted into unique non-word concept by reading the document 405. At step 215, the first word 
tokens. This process is the same as the process described token in document 405 is read. Because each of the words 
above in connection with the preprocessing of documents. 55 in text 406 constitutes a word token, the first word token is 

From step 355, the method continues to decision step 340, "The." 
where a determination is made as to whether there are more From step 210, the method continues to step 220 where 
word tokens contained in the search query. If there are more the concept database 63 is consulted to determine if it 
word tokens contained in the search query, the "YES" contains the word token "The." Concept database 63 con- 
branch is taken to step 345 where the next word token in the <so tains word tokens 407 which map to zero or more concept 
search query is read. If there are no more word tokens identifiers 408. Because concept database 63 does not con- 
contained in the search query, the "NO" branch is taken to tain the word token "The," the "NO" branch is taken from 
step i 350, where the concept tokens are weighted according step 225 to decision step 240, where a determination is made 
to the number of word tokens in the search query which as to whether the document 305 contains more word tokens 
referenced concept identifiers associated with the concept 65 Because document 305 does contain additional word tokens 
token. This weighting is accomplished by normalizing the the "YES" branch is followed to step 245 where the next 
number of occurrences of the concept token, with the largest word token, " appearance," is read from document 405 
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At step 220, the concept database 63 is again consulted to 
determine if it contains the word token "appearance." 
Because the concept database 63 does contain the word 
token "appearance," the "YES" branch is followed to step 
230, where the concept identifier 408 associated with the 
word token "appearance" is read. The weight associated 
with the word token 408 is also read from the concept 
database 63. The word token "appearance" is associated 
with the concept identifier "appearance." Because the word 

tnti-Ti AfW f<ac/v<lu<< lU. ~ nn ~m.~.4 ,V1 • ,' fi „ I., /a I 
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The exemplary search query 505 contains text 506 which 
reads: "Can I modify the color of a font to make it look more 
attractive?" 

From step 310, the method continues to step 315, where 
the first word token, "can," in search query 505 is read. At 
step 320, the word token "can" is looked up in the concept 
database 63. At decision step 325, a determination is made 
as to whether the word token "can" is contained in concept 
database 63. Because concept database 63 does not contain 



token 408 describes the concept identifier so strongly (they *\ .^IT^L^l^ 63 d0CS D ° l COnUin 

are identical), the numerical weight associated with the word 10 ^ word token can, the NO branch is taken to step 340 
token is 10 (out of a possible 10). whcre a determination is made as to whether the search 

From step 230, the method continues to step 235 where ^ ucr ^ conta »ns additional word tokens. Because exemplary 
the word token weight (10) is added to the sum of word search query 505 contains additional word tokens, the 
token weights for previous word tokens with the same "YES" branch is taken to step 345 where the next word 
concept identifier ("appearance"). Because there are no 15 token, "I" is read. 

previous word token weights for document 405, there is Ste P s 320 » 3 25, 340, and 345 of method 300 are repeated 
nothing to add and the method 200 continues at step 240. until a word token in exemplary search query 505 is encoun- 
Steps 225, 230, 235, 240, 225 and 220 are repeated in the lered which is contained in concept database 63. The first 
above-described manner until there are no more word tokens such word token is "color," which will be read at step 345 
in document 405. The method 200 then continues at step 20 of method 300. The method 300 then continues to step 320, 
250, where the sums of the word token weights for each of where the word token "color" is looked up in concept 
the concept identifiers 408 are normalized to 1000. In the database 63. At decision step 325, a determination is made 
exemplary document 405, three concept identifiers 408 are as to whether word token "color" is contained in concept 
referenced: "color," "font," and "appearance." The sums of database 63. Because "color" is contained in concept data- 
the word token weights for these three concept identifiers are 25 base 63, the "YES" branch is taken to step 330, where the 
34, 14, and 21, respectively. Therefore, the normalized concept identifiers 408 associated with the word token 
weights are 1000, 412 and 618, respectively. "color" are read from the concept database 63. The only 

From step 250, the method continues to step 255 where concept identifier associated with the word token "color" is 
each of the concept identifiers, "color," "font," and the concept identifier "color." At step 335, the concept 
appearance," are converted to unique non-word concept 30 identifier "color" is converted into a unique non-word con- 
tokens 409 Q1A5, QABC, and QX2H, respectively. This cept token using the procedure described above. The word 
process is described m detail above. At step 260, concept token "color," for instance, will be converted to the concept 
tokens 409 are arranged according to their associated nor- token Q1A5. 

malized weights. The concept token for "color" (Q1A5) is From step 335, the method continues to decision step 340, 
placed first] in the fist because it has the highest normalized 35 where a determination is made as to whether there are more 
weight (1000) and is followed by the concept token for word tokens in the search query 505. Because there are more 
appearance (QX2H) and then the concept token for "font" word tokens, the above procedure repeats until there are no 
(UABC). more word tokcns combed m mc search 505 

From step 260 the method continues to step 265 where When there are no more word tokens contained in search 
SKPiSSF^JT embeddedintodocume ^ 405 using 40 query 505, the method 300 branches to step 350, where 
HTML META tag 411 to create preprocessed document 410. concept tokens 409 are assigned a normalized weight 
In the exemplary embodiment, multiple occurrences of according to the number of times which they were refer- 
concept tokens may be inserted for concepts with high enced in search query 505, with the concept token with the 
normalized weights. For instance, because "appearance" had most occurrences being assigned 1000. Because the concept 
the highest normalized weight m document 405, multiple 45 token QX2H ("appearance") was referenced twice (word 

instance f\f tho nnnncr\t tnlr*n H1 AC ~ ~.. I 1 1 ■ .i_ _ ... , .. ... ' > 



instances of the concept token Q1A5 may be placed in the 
META tag. 

From step 265, the method continues to step 270 where 
preprocessed document 410 with -concept tokens 408 
inserted is passed to search engine 62. Search engine 62 then so 
adds preprocessed document 410 to the document database 
64 as it normally would. The preprocessing of document 405 
is completely invisible to search engine 62. The method 200 
ends at step 280. 

Preprocessing an Exemplary Search QueryO 

FIG. 5 is illustrates preprocessing a user search query 
using an exemplary query. Referring now to FIGS. 1, 3, and 
5, an exemplary search query 505 would typically be typed 
on keyboard 40 by a user for transmission to remote com- 
puter 49 using a browser application program 36. The query 
preprocessor application program 61 would intercept the 
search query 505 and preprocess it prior to submission to the 
search engine program 62. The operation of the query 
preprocessor application program 61 would be invisible to 
both the user and to the search engine 62. 

The method 300 for preprocessing a search query begins 
at step 310 where the exemplary search query 505 is read. 



tokens "look" and "attractive"), it is given the normalized 
weight 1000. The other two concept tokens (Q1A5 and 
QABC) are each assigned a normalized weight of 500 
because they were each only referenced one time. 

From step 350, the method continues to step 355, where 
the concept tokens 409 are ordered into a list according to 
their assigned normalized weights. The normalized weights 
are also included in the list along with the concept tokens. 
Text is prepended to the list to instruct the search engine to 
55 search the "nyms" portion of the META tag for the concept 
tokens. An exemplary string may look like: 

<Search META "nyms" for OX2 H (1000) Q1A5 (500) QABC 
(500)> 

60 



From step 355, the method continues to step 360, where 
the string including the concept tokens 409 and their nor- 
malized weights are passed to the search engine as a normal 
search query. The method 300 ends at step 370. 

In view of the foregoing, it will be appreciated that the 
65 present invention provides a method and apparatus for 
concept searching using a Boolean or keyword search 
engine. It should be understood that the foregoing relates 



06/08/2004, EAST Version: 1.4.1 



♦ 



11 



US 6,363373 Bl 



only to specific embodiments of the present invention, and 
that numerous changes may be made therein without depart- 
ing from the spirit and scope of the invention as defined by 
the following claims. 
What is claimed is: 

1. A computer-readable medium on which is stored a 
computer program for preprocessing a document comprising 
one or more word tokens, the computer program comprising 
instructions which, when executed by a computer, perform 
the steps of: 

determining whether one of the word tokens in the 
document is contained in a concept database; 

in response to determining that one of the word token s is 
contained in the concept database, reading a plurality of 
concept identifiers associated with the word token from 
the concept database; and 

in response to reading the concept identifier, assigning the 
concept identifiers to unique non-word concept tokens, 
and embedding the concept tokens in the document for 
use by a search engine not otherwise capable of concept 
searching. 

2. The computer-readable medium of claim 1, further 
comprising the following steps after the assigning step: 

determining whether the document contains additional 

word tokens; and 
in response to determining that the document contains 

additional word tokens, incrementing to the next word 

token contained in said document and repeating from 

the first determining step. 

3. A computer-readable medium on which is stored a 
computer program for preprocessing a document comprising 
one or more word tokens, the computer program comprising 
instructions which, when executed by a computer, perform 
the steps of: 

determining whether one of the word tokens is contained 
in a concept database; 

in response to determining that the word token is con- 
tained in the concept database, reading a plurality of 
concept identifiers associated with the word token from 
the concept database, and reading a numerical weight 
associated with the word token from the concept data- 
base; 

in response to reading the concept identifiers and weights, 
adding the numerical weights to the sum of any numeri- 
cal weights for previous word tokens associated with 
the concept identifiers to create a sum of word token 
weights for each of the plurality of concept identifiers; 
in response to adding the weights, determining whether 
the document contains additional word tokens; 
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determining whether one of the word tokens in the query 
is contained in a concept database; 

in response to determining that the word token is con- 
tained in the concept database, reading concept iden- 
tifiers associated with the word token from the concept 
database; and 

in response to reading concept identifiers, assigning the 
concept identifiers to unique non-word concept tokens 
and passing the concept identifiers to a search engine 
not otherwise capable of concept searching as search 
parameters. 

5. The computer-readable medium of claim 4, further 
comprising the following steps after the reading step and 
before the assigning step: 

determining whether the query contains additional word 
tokens; and 

in response to determining that the query contains addi- 
tional word tokens, selecting the next word token 
contained in the query and repeating from the first 
determining step. 

6. A computer-readable medium on which is stored a 
computer program for preprocessing a query comprising one 
or more word tokens, the computer program comprising 
instructions which, when executed by a computer, perform 
the steps of: 

determining whether one of the word tokens in the query 
is contained in a concept database; 

in response to determining that the word token is con- 
m tained in the concept database, reading concept iden- 
tifiers associated with the word token from the concept 
database; 

in response to reading concept identifiers, assigning the 
concept identifiers to unique concept tokens, and deter- 
mining whether the query contains additional word 
tokens; 

in response to determining that the query contains addi- 
tional word tokens, selecting the next word token 
contained in the query and repeating from the first 
determining step; and 
in response to determining that the query does not contain 
additional word tokens, assigning each concept token a 
normalized weight based upon the number of occur- 
rences of each of the concept tokens, arranging each of 
the concept tokens according to the value of the nor- 
malized weights associated with said concept tokens, 
and passing the concept tokens and normalized weights 
to the search engine. 
7. The computer-readable medium of claim 6, wherein the 
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m response to determining that the document contains 50 arranging step further comprises removing concept tokens 
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additional word tokens, incrementing to the next word 
token contained in said document and repeating from 
the first determining step; and 
in response to determining that the document does not 
contain additional word tokens, normalizing the sums 55 
of word token weights for each of the plurality of 
concept identifiers, arranging each of the plurality of 
concept identifiers according to the value of said nor- 
malized sums of word token weights, converting each 
of the plurality of concept identifiers to unique concept 60 
tokens, and embedding the concept tokens in the docu- 
ment. 

4. A computer-readable medium on which is stored a 
computer program for preprocessing a query comprising one 
or more word tokens, the computer program comprising 65 
instructions which, when executed by a computer, perform 
the steps of: 



whose normalized weights are less than a threshold value. . 

8. A method for preprocessing a document comprising 
one or more word tokens, the method comprising the steps 
of: 

determining whether one of the word tokens in the 
document is contained in a concept database; and 

in response to determining that the word token is con- 
tained in the concept database, reading concept iden- 
tifiers associated with the word token from the concept 
database, converting the concept identifiers to unique 
non-word concept tokens, and embedding the concept 
tokens in the document for use by a search engine not 
otherwise capable of concept searching. 

9. The method of claim 8, further comprising the follow- 
ing steps after the embedding step: 

determining whether the document contains additional 
word tokens; and 
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in response to determining that the document contains 
additional word tokens, selecting the next word tokeo 
in the document and repeating from the first determin- 
ing step. 

10. A method for preprocessing a document comprising 
one or more word tokens, the method comprising the steps 
of: 

determining whether one of the word tokens in the 
document is contained in a concept database; 

in response to determining one of the word tokens is 
contained in the concept database, reading concept 
identifiers associated with the word token from the 
concept database, and reading a numerical weight asso- 
ciated with the word token from the concept database; 

in response to reading concept identifiers and a numerical 
weight, adding the numerical weight to the sum of any 
numerical weights for any previous word tokens asso- 
ciated with the plurality of concept identifiers to create 
a sum of word token weights for each of said plurality 
of concept identifiers and determining whether said 
document contains additional word tokens; 

in response to determining that the document contains 
additional word tokens, selecting the next word token 
contained in the document and repeating from the 
determining step; and 

in response to determining that the document does not 
contain additional word tokens, normalizing the sums 
of word token weights for each of the concept 
identifiers, arranging each of the concept identifiers 
according to the value of the normalized sums of word 
token weights, converting each of the concept identi- 
fiers to unique concept tokens, and embedding the 
concept tokens in the document. 

11. A method for preprocessing a query comprising one or 
more word tokens, the method comprising the steps of: 

determining whether one of the word tokens in the query 
is contained in a concept database; 

in response to determining that the word token is con- 
tained in the concept database, reading concept iden- 
tifiers associated with said word token from said con- 40 
cept database; and 

in response to reading concept identifiers, assigning the 
concept identifiers to unique non-word concept tokens 
said passing the concept identifiers to the search engine 
for use by a search engine not otherwise capable of 
concept searching. 

12. The method of claim 11, further comprising the 
following steps after the reading step: 

determining whether the query contains additional word 
tokens; and 

in response to determining that the query contains addi- 
tional word tokens, selecting the next word token in the 
query and repeating from the first determining step. 

13. A method for preprocessing a query comprising a one ss 
or more word tokens, the method comprising the steps of: 

determining whether one of the word tokens in the query 
is contained in a concept database; 

in response to determining that the word token is con- 
tained in the concept database, reading a plurality of 6 q 
concept identifiers associated with the word token from 
the concept database, assigning each of the concept 
identifiers to concept tokens, and determining whether 
the query contains additional word tokens; 

in response to determining that the query contains addi- 65 
tional word tokens, selecting the next word token in the 
query and repeating from the first determining step; and 
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in response to determining that the query does not contain 
additional word tokens, assigning each concept token a 
normalized weight based upon the number of occur- 
rences of each of the concept tokens, arranging each of 
he concept tokens according to the value of the nor- 
malized weights associated with the concept tokens, 
and passing the concept tokens and normalized weights 
to the search engine. 

14. The method of claim 13, wherein the arranging step 
further comprises removing concept tokens whose normal- 
ized weights are less than a threshold value. 

15. A computer apparatus for preprocessing a document 
comprising one or more word tokens, the computer appa- 
ratus comprising: 

a processor; 

a storage unit coupled to the processor, the storage unit 
maintaining the document and a concept database com- 
prising a plurality of word tokens associated with a 
plurality of concept identifiers; 

a memory coupled to the processor; 

the processor being operative to read one of the word 
tokens from the document; 

determine whether the word token is contained in the 
concept database; , 

in response to determining that the word token is con- 
tained in the concept database, said processor operative 
to 

read concept identifiers associated with the word token 
from the concept database, 

to read a numerical weight associated with the word token 
from said concept database, 

to add the numerical weight to the sum of any numerical 
weights for any previous word tokens associated with 
said plurality of concept identifiers to create a sum of 
word token weights for each of said plurality of concept 
identifiers, 

and to determine whether the document contains addi- 
tional word tokens; 

in response to determining that the document contains 
additional word tokens, said processor operative to read 
the next word token from said document and repeat 
from the first determining step; and 

in response to determining that the document docs not 
contain additional word tokens, said processor opera- 
tive 

to normalize the sums of word token weights for each of 

the plurality of concept identifiers, 
to arrange each of said plurality of concept identifiers 

according to the value of said normalized sums of word 

token weights, 

to convert each of said plurality of concept identifiers to 
unique concept tokens, 

and to embed the concept tokens in the document. 

16. A computer apparatus for preprocessing a query 
comprising one or more word tokens, the computer appa- 
ratus comprising: 

a processor; 

a storage unit coupled to the processor, the storage unit 
maintaining the query and a concept database compris- 
ing a plurality of word tokens associated with a plu- 
rality of concept identifiers; 

a memory coupled to the processor; 

the processor being operative to 
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read one of the plurality of word tokens from the query; 

determine whether the word token is contained in the 
concept database; 

in response to determining that the word token is con- 
tained in the concept database, said processor operative 
to 

read concept identifiers associated with the word token 
from the concept database, 

to assign each of the concept identifiers to unique concept 
tokens, and 

to determine whether the query contains additional word 
tokens; 

in response to determining that the query contains addi- 
tional word tokens, said processor operative to read the 
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next word token contained in said query and repeat 

from the first determining step; and 
in response to determining that the query does not contain 

additional word tokens, said processor operative to 
assign each of the concept tokens a normalized weight 

based upon the number of occurrences of each of the 

concept tokens, 

to arrange each of the concept tokens according to the 
value of the normalized weights associated with the 
concept tokens, and 

to transmit the concept tokens and the normalized weights 
to the search engine. 
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